bridge: add guest-side reconnect loop for live migration#2698
bridge: add guest-side reconnect loop for live migration#2698shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
| } | ||
| const commandPort uint32 = 0x40000000 | ||
|
|
||
| // Reconnect loop: on each iteration we create a fresh bridge+mux, dial the |
There was a problem hiding this comment.
In general, an exponential backoff is the right answer. But in this case, the VM is frozen in time, and only wakes up when the host shim is ready. The connection should be immediately available. I think I'd rather see this a very tight loop personally
There was a problem hiding this comment.
Agreed — the VM is frozen and wakes up with the host ready, so the vsock should be available right away. I'll switch to a tight fixed-interval retry (e.g. 100ms) instead of exponential backoff.
| logrus.Info("bridge connected, serving") | ||
| bo.Reset() | ||
|
|
||
| serveErr := b.ListenAndServe(bridgeIn, bridgeOut) |
There was a problem hiding this comment.
Why cant you just reset the isQuitPending and call ListenAndServe again? Wouldnt that "just work"?
There was a problem hiding this comment.
It almost works, but there's a subtle issue with handler goroutines. The handler dispatch at line 356 spawns go func(r *Request) { ... b.responseChan <- br }(req) — this goroutine captures b and sends to b.responseChan, which is a struct field. If a handler is still in-flight when ListenAndServe returns (say a slow CreateContainer or ExecProcess), and we call ListenAndServe again on the same bridge, the new call overwrites b.responseChan = make(chan ...) while the old handler is about to send to it. That's a data race on the struct field — the old goroutine reads b.responseChan concurrently with the new ListenAndServe writing it.
In practice this window is very small (handlers finish fast), so it wouldn't show up in normal LM testing. But under load — say a CreateContainer request arrives right as the vsock drops during migration — the handler goroutine could be mid-flight when we re-enter ListenAndServe.
Recreating Bridge means the old handlers hold a reference to the old (now-dead) bridge with its own channels, and the new bridge has completely separate state. No shared mutable field.
That said, if you think the simplicity of reuse outweighs this edge case, we could make it work by not closing responseChan in the defers and adding a short drain period before re-entering. Happy to go either way.
During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
dbc66f1 to
05c7170
Compare
Fixes #2669
Problem
During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover —
ListenAndServereturns with an I/O error, and the GCS has no way to re-establish communication with the new host. This leaves the guest unable to process any further container lifecycle operations after migration.What this does
Adds a reconnect loop around the bridge lifecycle in
cmd/gcs/main.go. When the bridge connection drops (detected byListenAndServereturning), the GCS re-dials the host on the vsock command port and creates a freshBridge+Mux. TheHoststate (containers, processes, cgroups) persists across reconnections since it lives outside the bridge.A
Publisheris added tointernal/guest/bridge/publisher.goto solve the goroutine lifetime mismatch: container wait goroutines are spawned duringCreateContainerand outlive the bridge that created them. When a container exits, its wait goroutine callsPublisher.Publish()which routes the notification through whichever bridge is currently active. If no bridge is connected (during the reconnect gap), the notification is dropped — the host-side shim recovers by re-querying container state after reconnecting.The defer ordering in
ListenAndServeis fixed soquitChancloses beforeresponseChanbecomes invalid, preventing a panic whenPublishNotificationraces with bridge teardown.responseChanis buffered to absorb in-flight responses during shutdown.Design
The approach follows the existing
runWithRestartMonitorpattern already used incmd/gcs/main.gofor chronyd — a loop with exponential backoff that retries forever.Key design decisions:
Bridge+Muxis created per iteration. Channels and handler closures are scoped to eachListenAndServecall, so there is no stale state to reset.hcsv2.Hostholds containers in mutex-guarded maps. It is created once and reused across all bridge iterations. Container state survives the bridge drop.WaitForProcesswhich blocks on the container's actual exit status — the notification is a convenience, not the source of truth.ShutdownRequested()returns true and the loop breaks instead of reconnecting.Reference: Kevin Parsons' live migration POC for the reconnect concept. This implementation simplifies the POC down to the minimum — just the main loop + Publisher (~90 lines of new code).
Changes
cmd/gcs/main.gofor{}loop with exponential backoff (1s–30s, retry forever)internal/guest/bridge/bridge.goPublisherfield onBridge,ShutdownRequested()method, fixed defer ordering, bufferedresponseChan, priority select guard inPublishNotificationinternal/guest/bridge/bridge_v2.goPublisher.Publish()internal/guest/bridge/publisher.gointernal/guest/bridge/publisher_test.goTesting
Tested on a two-node Hyper-V live migration setup using the
TwoNodeInfratest module:Invoke-FullLmTestCycle -Verbose— deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Containerlcow-testsuccessfully migrated with pod sandbox intact.crictl exec— created a fresh LCOW pod on Node_1 with our custom GCS (deployed viarootfs.vhd), started a container that writes a file, exec'dcat /tmp/dummy.txtto verify bridge communication works end-to-end.go build,go vet, andgofmtclean on all modified packages.